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We study the effective degrees of freedom of the lasso in the 
framework of Stein's unbiased risk estimation (SURE). We show that 
the number of nonzero coefficients is an unbiased estimate for the de- 
grees of freedom of the lasso — a conclusion that requires no special 
assumption on the predictors. In addition, the unbiased estimator is 
shown to be asymptotically consistent. With these results on hand, 
various model selection criteria — Cp, AIC and BIG — are available, 
which, along with the LARS algorithm, provide a principled and ef- 
ficient approach to obtaining the optimal lasso fit with the computa- 
tional effort of a single ordinary least-squares fit. 

1. Introduction. The lasso is a popular model building technique that si- 
multaneously produces accurate and parsimonious models (Tibshirani [22]). 
Suppose y = (yi, . . . ,yn)^ is the response vector and — (^xij, . . . ,Xnj) , 
j = 1, . . . ,p, are the linearly independent predictors. Let X = [xi, . . . ,Xp] be 
the predictor matrix. Assume the data are standardized. The lasso estimates 
for the coefficients of a linear model are obtained by 
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where A is called the lasso regularization parameter. What we show in this 
paper is that the number of nonzero components of /? is an exact unbiased 
estimate of the degrees of freedom of the lasso, and this result can be used 
to construct adaptive model selection criteria for efficiently selecting the 
optimal lasso fit. 

Degrees of freedom is a familiar phrase for many statisticians. In linear 
regression the degrees of freedom is the number of estimated predictors. 
Degrees of freedom is often used to quantify the model complexity of a 
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statistical modeling procedure (Hastie and Tibshirani [10]). However, gen- 
erally speaking, there is no exact correspondence between the degrees of 
freedom and the number of parameters in the model (Ye [24]). For exam- 
ple, suppose we first find xj* such that |cor(xj*,y)| is the largest among 
all Xj,j = 1,2, . . . ,p. We then use xj* to fit a simple linear regression model 
to predict y. There is one parameter in the fitted model, but the degrees 
of freedom is greater than one, because we have to take into account the 
stochastic search of Xj*. 

Stein's unbiased risk estimation (SURE) theory (Stein [21]) gives a rig- 
orous definition of the degrees of freedom for any fitting procedure. Given 
a model fitting method 6, let fj, = S{y) represent its fit. We assume that 
given the x's, y is generated according to y ~ (/i,(T^I), where is the true 
mean vector and cr^ is the common variance. It is shown (Efron [4]) that the 
degrees of freedom of 5 is 



For example, if 5 is a linear smoother, that is, /i = Sy for some matrix 
S independent of y, then we have cov(/i,y) = cr^S, df{fi) =tr(S). SURE 
theory also reveals the statistical importance of the degrees of freedom. 
With df defined in (1.2), we can employ the covariance penalty method to 
construct a Cp-type statistic as 



Efron [4] showed that Cp is an unbiased estimator of the true prediction 
error, and in some settings it offers substantially better accuracy than cross- 
validation and related nonparametric methods. Thus degrees of freedom 
plays an important role in model assessment and selection. Donoho and 
Johnstone [3] used the SURE theory to derive the degrees of freedom of 
soft thresholding and showed that it leads to an adaptive wavelet shrinkage 
procedure called SureShrink. Ye [24] and Shen and Ye [20] showed that 
the degrees of freedom can capture the inherent uncertainty in modeling 
and frequentist model selection. Shen and Ye [20] and Shen, Huang and 
Ye [19] further proved that the degrees of freedom provides an adaptive 
model selection criterion that performs better than the fixed-penalty model 
selection criteria. 

The lasso is a regularization method which does automatic variable selec- 
tion. As shown in Figure 1 (the left panel), the lasso continuously shrinks 
the coefficients toward zero as A increases; and some coefficients are shrunk 
to exactly zero if A is sufficiently large. Continuous shrinkage also often im- 
proves the prediction accuracy due to the bias-variance trade-off. Detailed 
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Fig. 1. Diabetes data with ten predictors. The left panel shows the lasso coefficient esti- 
mates fij,j = 1,2,..., 10, for the diabetes study. The lasso coefficient estimates are piece- 
wise linear functions of X (Osborne, Presnell and Turlach [15] and Efron, Hastie, John- 
stone and Tibshirani [5]), hence they are piece-wise nonlinear as functions o/log(l + A). 
The right panel shows the curve of the proposed unbiased estimate for the degrees of freedom 
of the lasso. 



discussions on variable selection via penalization are given in Fan and Li [6], 
Fan and Peng [8] and Fan and Li [7]. In recent years the lasso has attracted 
a lot of attention in both the statistics and machine learning communities. It 
is of great interest to know the degrees of freedom of the lasso for any given 
regularization parameter A for selecting the optimal lasso model. However, 
it is difficult to derive the analytical expression of the degrees of freedom of 
many nonlinear modeling procedures, including the lasso. To overcome the 
analytical difficulty, Ye [24] and Shen and Ye [20] proposed using a data- 
perturbation technique to numerically compute an (approximately) unbiased 
estimate for df[jl) when the analytical form of fi is unavailable. The boot- 
strap (Efron [4]) can also be used to obtain an (approximately) unbiased 
estimator of the degrees of freedom. This kind of approach, however, can be 
computationally expensive. It is an interesting problem of both theoretical 
and practical importance to derive rigorous analytical results on the degrees 
of freedom of the lasso. 

In this work we study the degrees of freedom of the lasso in the framework 
of SURE. We show that for any given A the number of nonzero predictors in 
the model is an unbiased estimate for the degrees of freedom. This is a finite- 
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Fig. 2. The diabetes data: Cp and BIC curves with ten (top) and 64 (bottom) predictors. 
In the top panel Cp and BIC select the same model with seven nonzero coefficients. In the 
bottom panel, Cp selects a model with 15 nonzero coefficients and BIC selects a model with 
11 nonzero coefficients. 

sample exact result and the result holds as long as the predictor matrix is 
a full rank matrix. The importance of the exact finite-sample unbiasedness 
is emphasized in Efron [4], Shen and Ye [20] and Shen and Huang [18]. We 
show that the unbiased estimator is also consistent. As an illustration, the 
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right panel in Figure 1 displays the unbiased estimate for the degrees of 
freedom as a function of A for the diabetes data (with ten predictors). 

The unbiased estimate of the degrees of freedom can be used to construct 
Cp and BIC type model selection criteria. The Cp (or BIC) curve is easily 
obtained once the lasso solution paths are computed by the LARS algorithm 
(Efron, Hastie, Johnstone and Tibshirani [5]). Therefore, with the compu- 
tational effort of a single OLS fit, we are able to find the optimal lasso fit 
using our theoretical results. Note that Cp is a finite-sample result and re- 
lies on its unbiasedness for prediction error as a basis for model selection 
(Shen and Ye [20], Efron [4]). For this purpose, an unbiased estimate of the 
degrees of freedom is sufficient. We illustrate the use of Cp and BIC on the 
diabetes data in Figure 2, where the selected models are indicated by the 
broken vertical lines. 

The rest of the paper is organized as follows. We present the main results 
in Section 2. We construct model selection criteria — Cp or BIC — using the 
degrees of freedom. In Section 3 we discuss the conjecture raised in [5]. 
Section 4 contains some technical proofs. Discussion is in Section 5. 

2. Main results. We first define some notation. Let fiy^ be the lasso fit 
using the representation (1.1). fii is the ith component of (i. For convenience, 
we let df{\) stand for df{p.x), the degrees of freedom of the lasso. Suppose 
M is a matrix with p columns. Let 5 be a subset of the indices {1,2, ... ,p}. 
Denote by M5 the submatrix M5 = [■ ■ ■ Mj ■ ■ ■]j(zs, where Mj is the jth 
column of M. Similarly, define f)s = {' " l^j ' ")j<^s for any vector /3 of length 
p. Let Sgn(-) be the sign function: Sgn(x) = 1 if 2; > 0; Sgn(2;) = if a; = 0; 
Sgn(x) = — 1 if X = —1. Let B = {j : Sgn(/3)j / 0} be the active set of j3, 
where Sgn(/?) is the sign vector of /3 given by Sgn(/3)j = Sgn(/3j). We denote 
the active set of /3(A) as B{\) and the corresponding sign vector Sgn(/3(A)) 
as Sgn(A). We do not distinguish between the index of a predictor and the 
predictor itself. 

2.1. The unbiased estimator of df{X). Before delving into the technical 
details, let us review some characteristics of the lasso solution (Efron et al. 
[5]). For a given response vector y, there is a finite sequence of A's, 

(2.1) Ao>Ai > A2>--->Ai^ = 0, 

such that: 

• For ah A> Ao, /5(A) =0. 

• In the interior of the interval (Am+i, Am), the active set B{X) and the sign 
vector Sgn(A)g(A) are constant with respect to A. Thus we write them as 
Bm and Sgn„^ for convenience. 
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The active set changes at each Am- When A decreases from A = Am — 0, some 
predictors with zero coefficients at \m are about to have nonzero coefficients; 
thus they join the active set Bm - However, as A approaches A^+i + there are 
possibly some predictors in Bm whose coefficients reach zero. Hence we call 
{Am} the transition points. Any A G [0,oo) \ {Am} is called a nontransition 
point. 

Theorem 1. VA the lasso fit fixiy) is a uniformly Lip schitz function on 
y. The degrees of freedom of p^^ly) equal the expectation of the effective set 
B\, that is, 

(2.2) df{\) = E\Bx\. 

The identity (2.2) holds as long as X is full rank, that is, rank(X) =p. 

Theorem 1 shows that df{X) = \B\\ is an unbiased estimate for df[\). Thus 
(if (A) suffices to provide an exact unbiased estimate to the true prediction 
risk of the lasso. The importance of the exact finite-sample unbiasedness 
is emphasized in Efron [4], Shen and Ye [20] and Shen and Huang [18]. 
Our result is also computationally friendly. Given any data set, the entire 
solution paths of the lasso are computed by the LARS algorithm (Efron et 
al. [5]); then the unbiased estimator df{\) = \B\\ is easily obtained without 
any extra effort. 

To prove Theorem 1 we shall proceed by proving a series of lemmas whose 
proofs are relegated to Section 4 for the sake of presentation. 

Lemma 1. Suppose AG (Am+i,Am)- /3(A) are the lasso coefficient esti- 
mates. Then we have 

(2.3) = i^L^Bj-' (x^y - ^ Sgn„) . 

Lemma 2. Consider the transition points Am and Am+i, Am+i > 0. Bm 
is the active set in {Xm+i, ^m) ■ Suppose iadd is an index added into Bm at 
Am and its index in Bm is i* , that is, iadd = [Bm)i* ■ Denote by {a)k the kth 
element of the vector a. We can express the transition point Am, as 

(2.4) Xrr 



((Xg, Xi5,J-iSgn„ 



Moreover, if j drop is a dropped (if there is any) index at Am+i and jdmp 
{Bm)j*, then Xm+i can be written as 



((X^,^XB„)-iSgn^),. 
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Lemma 3. VA > 0, 3 a null set A/a which is a finite collection of hyper- 
planes in M". Let Q\ = R" \N\. Then Vy ^Q\, A is not any of the transition 
points, that is, A^ {A(y)m}- 

Lemma 4. VA, /3a (y) is a continuous function ofy. 

Lemma 5. Fix any A > and consider y £ Gx o-s defined in Lemma 3. 
The active set B{X) and the sign vector Sgn(A) are locally constant with 
respect to y. 

Lemma 6. Let Qq = W\ Fix an arbitrary A > 0. On the set Q\ with full 
measure as defined in Lemma 3, the lasso fit p^xiy) uniformly Lipschitz. 
Precisely, 

(2-6) ||/iA(y + Ay) - AA(y)ll < ll^yll for sufficiently small Ay. 
Moreover, we have the divergence formula 

(2.7) V-fi,iy) = \Bx\. 

Proof of Theorem 1. Theorem 1 is obviously true for A = 0. We 
only need to consider A > 0. By Lemma 6 fJ-xiy) is uniformly Lipschitz 
on Qx- Moreover, fixiy) is a continuous function of y, and thus p^xiy) is 
uniformly Lipschitz on R". Hence fj-xiy) is almost differentiable; see Meyer 
and Woodroofe [14] and Efron et al. [5]. Then (2.2) is obtained by invoking 
Stein's lemma (Stein [21]) and the divergence formula (2.7). □ 

2.2. Consistency of the unbiased estimator df{X). In this section we show 
that the obtained unbiased estimator df{X) is also consistent. We adopt the 
similar setup in Knight and Fu [12] for the asymptotic analysis. Assume the 
following two conditions: 

1. yi = XiP* + Si, where ei, . . . ,£n are i.i.d. normal random variables with 
mean and variance o"^, and (3* denotes the fixed unknown regression 
coefficients. 

2. iX-^X — > C, where C is a positive definite matrix. 

We consider minimizing an objective function Zx{P) defined as 

(2.8) zxiP) = {p- f3*fc{p - n +\jzm. 

Optimizing (2.8) is a lasso type problem: minimizing a quadratic objective 
function with an ii penalty. There are also a finite sequence of transition 
points {A*m} associated with optimizing (2.8). 
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Theorem 2. If ^ X* > 0, where A* is a nontransition point such 
that A* 7^ A*m for all m, then df{X^) — (i/(A*) — > in probability. 

Proof of Theorem 2. Consider (3* = argmin^ Zx'ilS) and let be 
the lasso solution given in (1.1) with A = a;. Denote ^(") = {j : (if^ / 0, 1 < 

j < p} and B* = {j : [3* ^ 0,1 < j <p} . We want to show P(i3(") =B*)^1. 
First, let us consider any j £ B* . By Theorem 1 in Knight and Fu [12] we 
know that /J^") -^p [3* . Then the continuous mapping theorem implies that 
Sgn(/3j'^^) — >p Sgn(/3j*) / 0, since Sgn(x) is continuous at all x but zero. Thus 
D B*) 1. Second, consider any j' ^ B* . Then (3*. = 0. Since (3* is the 
minimizer of Z\*(f3) and A* is not a transition point, by the Karush-Kuhn- 
Tucker (KKT) optimality condition (Efron et al. [5], Osborne, Presnell and 
Turlach [15]), we must have 

(2.9) X* >2\Cy{p* - 

where Cj' is the j'th row vector of C. Let r* = X* - 2\Cy{(3* - f3*)\ > 0. 
Now let us consider r„, = A* - 2|xjv(y - X/3*)|. Note that 

(2.10) xj;(y - X/3:) = x];X(/3* - /?:) + ^J,e. 

Thus ^ = ^ - 2|ixj;X(/3* - 13*) + xj;e/n|. Because -^p (3* and 

xj^e/n — >p 0, we conclude ^ — >p r* > 0. By the KKT optimality condition, 

< > implies p'f'^ = 0. Thus P{B* D 5") ^ 1. Therefore = B*)^l. 

Immediately we see df{X'!^) -^p \B*\. Then invoking the dominated conver- 
gence theorem we have 

(2.11) df{X*J = E[df{X:)]^\B*\. 
Sodf{X*J-df{X*J^pO. □ 

2.3. Numerical experiments. In this section we check the validity of our 
arguments by a simulation study. Here is the outline of the simulation. We 
take the 64 predictors in the diabetes data set, which include the quadratic 
terms and interactions of the original ten predictors. The positive cone con- 
dition is violated on the 64 predictors (Efron et al. [5]). The response vector 
y is used to fit an OLS model. We compute the OLS estimates /3ois and a^jg. 
Then we consider a synthetic model, 



(2.12) y* = X(3 + N{0,l)a, 

where /3 = /3ois and a = (TqIs. 
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Given the synthetic model, the degrees of freedom of the lasso can be 
numerically evaluated by Monte Carlo methods. For b = 1,2, . . . ,B, we in- 
dependently simulate y*{b) from (2.12). For a given A, by the definition of 
df, we need to evaluate covj = cov {p,i,y*). Then df = J27=i "^o"^* Z"^^- Since 
E[y*] = (X/3)j and note that covj = E[{fii — ai){y* — (X/?)j)] for any fixed 
known constant Oj. Then we compute 

(2.13) covj = 



B 

and df = J27=i covj/cr^. Typically = is used in Monte Carlo calculation. 
In this work we use = (X/3)j, for it gives a Monte Carlo estimate for 
df with smaller variance than that given by Oj = 0. On the other hand, 
we evaluate by J2b=i^fWb/B. We are interested in —df{X). 

Standard errors are calculated based on the B replications. Figure 3 shows 
very convincing pictures to support the identity (2.2). 

2.4. Adaptive model selection criteria. The exact value of df{X) depends 
on the underlying model according to Theorem 1. It remains unknown to 
us unless we know the underlying model. Our theory provides a convenient 
unbiased and consistent estimate of the unknown df{X). In the spirit of 
SURE theory, the good unbiased estimate for df{X) suffices to provide an 
unbiased estimate for the prediction error of fix ^ 

(2.14) Cp(A) = +-df{(i)a'. 

n n 

Consider the Cp curve as a function of the regularization parameter A. We 
find the optimal A that minimizes Cp. As shown in Shen and Ye [20], this 
model selection approach leads to an adaptively optimal model which essen- 
tially achieves the optimal prediction risk as if the ideal tuning parameter 
were given in advance. 

By the connection between Mallows' Cp (Mallows [13]) and AIC (Akaike [1]), 
we use the (generalized ) Cp formula (2.14) to equivalently define AIC for 
the lasso, 

(2.15) AIC(A) = + -df{fi). 

The model selection results are identical by Cp and AIC. Following the usual 
definition of BIC [16], we propose BIC for the lasso as 

(2.16) BIC(A) = fc# + !;^ ?(/.). 

AIC and BIC possess different asymptotic optimality. It is well known that 
AIC tends to select the model with the optimal prediction performance. 
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Fig. 3. The synthetic model with the 64 predictors in the diabetes data. In the top panel 
we compare E\Bx\ with the true degrees of freedom df{X) based on B — 20000 Monte Carlo 
simulations. The solid line is the 45° line (the perfect match). The bottom panel shows 
the estimation bias and its point-wise 95% confidence intervals are indicated by the thin 
dashed lines. Note that the zero horizontal line is well inside the confidence intervals. 



while BIC tends to identify the true sparse model if the true model is in the 
candidate list; see Shao [17], Yang [23] and references therein. We suggest 



DEGREES OF FREEDOM OF THE LASSO 



11 



using BIC as the model selection criterion when the sparsity of the model 
is our primary concern. 

Using either AIC or BIC to find the optimal lasso model, we are facing 
an optimization problem, 

II " l|2 

(2.17) A(optimal) =argmin^^^^4^ + — 

X na"^ n 

where Wn = 2 for AIC and Wn = log(n) for BIC. Since the LARS algorithm 
efficiently solves the lasso solution for all A, finding A (optimal) is attainable 
in principle. In fact, we show that A(optimal) is one of the transition points, 
which further facilitates the searching procedure. 

Theorem 3. To find A(optimal), we only need to solve 

II " l|2 

(2.18) m* = argmin^^^^^^^ + — Sf(A^); 
then A (optimal) = Am* ■ 

Proof. Let us consider A G (Am+i, Am)- By (2.3) we have 

(2.19) ||y - AaII' = y^^a - HsJy + ^ Sgn^,(Xg„ Xb„)-i Sgn^, 

where Hb^ = Xe„(Xg^Xe„)~iXg^. Thus we can conclude that Hy-AaP 
is strictly increasing in the interval (Am+i, Am) - Moreover, the lasso estimates 
are continuous on A, hence ||y — fiXm 11^ ^ 11^ ~ AaIP > lly ~ f^Xm+i 11^- 
other hand, note that df{X) = \Bm\ VA G (Am+i, Am) and \Bm\ > |i3(Am+i)|- 
Therefore the optimal choice of A in [Am+i,Am) is Am+i, which means 
A(optimal) G {Am}- D 

According to Theorem 3, the optimal lasso model is immediately selected 
once we compute the entire lasso solution paths by the LARS algorithm. We 
can finish the whole fitting and tuning process with the computational cost 
of a single least squares fit. 

3. Efron's conjecture. Efron et al. [5] first considered deriving the ana- 
lytical form of the degrees of freedom of the lasso. They proposed a stage- 
wise algorithm called LARS to compute the entire lasso solution paths. They 
also presented the following conjecture on the degrees of freedom of the lasso: 

Conjecture 1. Starting at step 0, let rn'l^^ he the index of the last 
LARS-lasso sequence containing exactly k nonzero predictors. Then 

(Amlast) = /C. 
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Note that Efron et al. [5] viewed the lasso as a forward stage- wise modehng 
algorithm and used the number of steps as the tuning parameter in the 
lasso: the lasso is regularized by early stopping. In the previous sections 
we regarded the lasso as a continuous penalization method with A as its 
regularization parameter. There is a subtle but important difference between 
the two views. The A value associated with m}^^^ is a random quantity. In 
the forward stage-wise modeling view of the lasso, the conjecture cannot be 
used for the degrees of freedom of the lasso at a general step k for a prefixed 
k. This is simply because the number of LARS-lasso steps can exceed the 
number of all predictors (Efron et al. [5]). In contrast, the unbiasedness 
property of df{X) holds for all A. 

In this section we provide some justifications for the conjecture: 

• We give a much more simplified proof than that in Efron et al. [5] to show 
that the conjecture is true under the positive cone condition. 

• Our analysis also indicates that without the positive cone condition the 
conjecture can be wrong, although A; is a good approximation of df {jj,^ia.st) . 

• We show that the conjecture works appropriately from the model selection 
perspective. If we use the conjecture to construct AIC (or BIG) to select 
the lasso fit, then the selected model is identical to that selected by AIC 
(or BIC) using the exact degrees of freedom results in Section 2.4. 

First, we need to show that with probability one we can well define the 
last LARS-lasso sequence containing exactly k nonzero predictors. Since 
the conjecture becomes a simple fact for the two trivial cases k = and 
k = p,we only need to consider k = I, . . . ,p — l. Let A^ = {m : \ l3x^ \ = k},k £ 
{1,2, . . . ,{p — 1)}. Then mjf*** = sup(Afc). However, it may happen that for 
some k there is no such m with \ Bx^ \ = k. For example, if y is an equiangular 
vector of all {Xj}, then the lasso estimates become the OLS estimates after 
just one step. So A^ = for k = 2, . . . ,p — 1. The next lemma shows that 
the "one at a time" condition (Efron et al. [5]) holds almost everywhere; 
therefore m)^^^ is well defined almost surely. 

Lemma 7. Let Wm(y) denote the set of predictors that are to be included 
in the active set at Am and let Vm(y) be the set of predictors that are deleted 
from the active set at Xm+i ■ Then 3 a set Mq which is a collection of finite 
many hyperplanes in W\ Vy € M" \ A/q, 

(3.1) \Wm.{y)\<l and |Vm(y)|<l Vm = 0, 1, . . . , K(y). 

y G \ A/o is said to be a locally stable point for A^, if Vy' such that 
||y' — y|| < ^(y) for a small enough e(y), the effective set ;S(A^iast)(y') = 

k 

i3(A,^iast)(y). Let LS{k) be the set of all locally stable points. 
The next lemma helps us evaluate df {fi^ia^st) . 
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Lemma 8. Let /i^(y) be the lasso fit at the transition point Xm, Am > 0. 
Then for any i G Wm, we can write jl[m) as 



Am(y) = |Hb(A™) 

(3.2) 



_ ^g(A^)(^B(A^)^g(An.)) Sgn(Am)x^ (I - Hg(A^)) 

Sgrii -xf X|(^^)(X^(^^)Xe(A„)) Sgn(Am) 

(3.3) =: Sm(y)y, 

where Hg(;^^) is the projection matrix on the subspace o/Xg(_Xm)- Moreover 

(3.4) tr(Sm(y)) = |e(Am)|. 

Note that |;B(A iast)| = k. Therefore, if y G LS{k), then 

k 

(3.5) V • /i w (y) = tr(S^iast (y)) = k. 

k k 

If the positive cone condition holds then the lasso solution paths are 
monotone (Efron et al. [5]), hence Lemma 7 implies that LS{k) is a set of 
full measure. Then by Lemma 8 we know that df{m}^^) = k. However, it 
should be pointed out that k — df{m}^'^) can be nonzero for some k when 
the positive cone condition is violated. Here we present an explicit example 
to show this point. We consider the synthetic model in Section 2.3. Note 
that the positive cone condition is violated on the 64 predictors [5]. As 
done in Section 2.3, the exact value of df{m}^'^) can be computed by Monte 
Carlo and then we evaluate the bias k — df{m}^^). In the synthetic model 
(2.12) the signal/noise ratio '^^''(^^^ois) jg about 1.25. We repeated the same 

ols 

simulation procedure with {(3 = /Jols, cr = ^^) in the synthetic model and the 
corresponding signal/noise ratio became 125. As shown clearly in Figure 4, 
the bias k — df{m}^^^) is not zero for some k. However, even if the bias 
exists, its maximum magnitude is less than one, regardless of the size of the 
signal/noise ratio, which suggests that /e is a good estimate of df{m}^'^). 

Let us pretend the conjecture is true in all situations and then define the 
model selection criteria as 

(3.6) 

77,0"^ 77 

Wn = 2 for AIC and Wn = log(77) for BIG. Treat k as the tuning parameter 
of the lasso. We need to find A;(optimal) such that 

/ „N w IN l|y ~ Am'''=* IP Wn , 

(3.7) Kloptimal) = argmm 1 k. 

k na'^ 77 
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Fig. 4. B = 20000 replications were used to assess the bias of ^/(mSf"') = k. The 95% 
point-wise confidence intervals are indicated by the thin dashed lines. This simulation sug- 
gests that when the positive cone condition is violated, df{m}^^) ^ k for some k. However, 
the bias is small (the maximum absolute bias is about 0.8), regardless of the size of the 
signal/ noise ratio. 



Suppose A* = A (optimal) and k* = A: (optimal). Theorem 3 implies that the 
models selected by (2.17) and (3.7) coincide, that is, jix* =/i^iast. This ob- 
servation suggests that although the conjecture is not always true, it actually 
works appropriately for the purpose of model selection. 

4. Proofs of the lemmas. First, let us introduce the following matrix 
representation of the divergence. Let be a ?i x n matrix whose elements 



dfi 

dy 



9yj 



dfi 

dy 



1,2,.. 



are 
(4.1) 

Then we can write 

(4.2) V-A = tr| 

The above trace expression will be used repeatedly. 
Proof of Lemma 1. Let 



(4.3) 



^(/3,y) 
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Given y, /3(A) is the minimizer of £{P,y). For those j G Bm we must have 
= 0, that is, 

(4.4) - 2xJ - jy,P{X),^ + A Sgn(/3(A),) = 0, for j G Bm- 

Since P{X)i = for all i ^ Bm, then J2^=i^jPi^)j = JljeBx-^jf^Wj- Thus 
the equations in (4.4) become 

(4.5) - 2Xg^ (y - XsJ{X)b^ ) + A Sgn„ = 0, 
which gives (2.3). □ 

Proof of Lemma 2. We adopt the matrix notation used in SPLITS: 
M[i,-] means the ith row of M. iadd joins Bm at A^; then P{^m)i^^^ = 0. 
Consider /3(A) for A G (Am+i, A^)- Lemma 1 gives 

(4.6) /3(A)e^ = (Xi^Xi3,J-i (xl^y - ^ Sgn„,) . 

By the continuity of taking the limit of the i*th element of (4.6) 

as A ^ Am — 0, we have 

(4.7) 2{(Xg^Xe^)-i[^*, -ixLly = Xm{{^l„,^B„r'[i* , ■] Sgn^}. 

The second {•} is a nonzero scalar, otherwise (3{X)i^^^ = for all A G (Am+i, Am), 
which contradicts the assumption that iadd becomes a member of the active 
set Bm- Thus we have 

"'" = f(Xj.X».J-M.-,.lSgn„j^''--^=^"<«-' '^«»^- 

whereKS„,i*) = {2((x5„Xe,J-'[i-,'l)/((Xj„XB„.)-'r,-|Sgii,„)),Rear- 

ranging (4.8), we get (2.4). 

Similarly, if jdrop is a dropped index at Am,+i, we take the limit of the 
j*th element of (4.6) as A — > Am+i + to conclude that 

where v{Bm,j*) = {2((Xg,, Xa^r^J*, •])/((XLXB™)'Mi*r] Sgrim)}- Re- 
arranging (4.9), we get (2.5). □ 

Proof of Lemma 3. Suppose for some y and m, A = A(y)m- A > 
means m is not the last lasso step. By Lemma 2 we have 

(4.10) A = Am = {v{Bm,n^ljy =■■ a{l3m,ny- 



16 H. ZOU, T. HASTIE AND R. TIBSHIRANI 

Obviously a{Bm,i*) = v{Bm,i*)'^'s^ is a nonzero vector. Now let ax be the 
totality of a{13m,i*) by considering all the possible combinations of Bm, i* 
and the sign vector Sgn„. ax depends only on X and is a finite set, since at 
most p predictors are available. Thus Va E ax, ay = X defines a hyperplane 
in R". We define 

A/a = {y : ay = A for some a £ ax} and = \ -^x- 

Then on Qx (4-10) is impossible. □ 

Proof of Lemma 4. For writing convenience we omit the subscript 
A. Let /3(y)ois = (X"^X)~-'^X-^y be the OLS estimates. Note that we always 
have the inequality 

(4.11) |/3(y)|i < |/3(y)ois|i. 

Fix an arbitrary yo and consider a sequence of {yn} (n = 1,2, . . .) such 
that yn — > yo- Since yn — > yoi we can find a Y such that ||yn|| < for all 
n = 0,1,2,.... Consequently ||/3(yn)ois|| < -B for some upper bound B {B 
is determined by X and Y). By Cauchy's inequality and (4.11), we have 

|/3(yn)|i < \/pB for all n = 0,l,2, Thus to show (i{yn) — > /3(yo)i it is 

equivalent to show that for every converging subsequence of {/3(yn)}, say 
{fi{ynk)}i the subsequence converges to /?(y). Now suppose P{ynk) converges 
to Poo as rifc oo. We show P^c = /3(yo)- The lasso criterion ^(/3,y) is written 
in (4.3). Let Ae{P,y,y') = i{P,y)-i{P,y'). By the definition of /3n^ , we must 
have 

(4.12) ^(/3(yo),ynJ>^(/3(y„J,ynJ. 

Then (4.12) gives 

myo), yo) = myo) , yn J + M0{yo), yo, yn, ) 
yn, ) 1 yn, 

) + A^(/3(yo),yo,yn,) 

(4-13) 

= ^(/?(yn,),yo) + A^(/3(yn,),y„,,yo) 
+ A£(/3(yo),yo,yn,). 

We observe 

A^(/3(yn,), yn, , yo) + A£(/3(yo), yo, yn,) 

(4.14) 

= 2(yo-y„,)X^(/3(y„J-/?(yo)). 

Let nfc — > oo; the right-hand side of (4.14) goes to zero. Moreover, ^(/3(ynfc))yo) 
^(/3ooiyo). Therefore (4.13) reduces to 

^(^(yo),yo)>^(/3oo,yo). 
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However, /3(yo) is the unique minimizer of i{l3, yo), and thus Poo = /3(yo)- 
□ 

Proof of Lemma 5. Fix an arbitrary yo G Gx- Denote by Ball(y,r) 
the n-dimensional ball with center y and radius r. Note that Gx is an open 
set, so we can choose a small enough e such that Ball(yo,e) C G\- Fix e. 
Suppose yn ^ y as n ^ oo. Then without loss of generality we can assume 
Yn £ Ball(yo, e) for all n. So A is not a transition point for any y„. 

By definition /3(yo)j for all j £ B{yQ). Then Lemma 4 says that 
3 an A''i, and as long as n > Ni, we have (3{yn)j / and Sgn(/3(y„)) = 
Sgn(/3(y„)), for all j G i3(yo). Thus ^(yo) C i3(y„) Vn > iVi. 

On the other hand, we have the equiangular conditions (Efron et al. [5]) 

(4.15) A = 2|xJ(yo-X/3(yo))| VjGS(yo), 

(4.16) A>2|xJ(yo-X/3(yo))| Vj ^ 5(yo). 

Using Lemma 4 again, we conclude that 3 an > A''i such that Vj ^ ;S(yo) 
the strict inequalities (4.16) hold for y„ provided n > N . Thus S'^(yo) ^ 
'S'^(yn) yn> N . Therefore we have B{yn) = ^{yo) Vn > A^. Then the local 
constancy of the sign vector follows the continuity of /?(y). □ 

Proof of Lemma 6. If A = 0, then the lasso fit is just the OLS fit. 
The conclusions are easy to verify. So we focus on A > 0. Fix an y. Choose 
a small enough e such that Ball(y,e) C Gx- 

Since A is not any transition point, using (2.3) we observe 

(4.17) (ixiy) = X/3(y) = HA(y)y - Xux{y), 

where H;j(y) = X/3^(Xg^X/3^)~^X^^ is the projection matrix on the space 
Xg^ and u;x{y) = ^Xe;,(X^^XB^)"^ Sgug^. Consider ||Ay|| <e. Similarly, 
we get 

(4.18) AA(y + Ay) = HA(y + Ay)(y + Ay) - Aa;A(y + Ay). 

Lemma 5 says that we can further let e be sufficiently small such that 
both the effective set Bx and the sign vector Sgn^ stay constant in Ball(y, e). 
Now fix £. Hence if ||Ay|| < e, then 

(4.19) HA(y + Ay) = HA(y) and ^^^(y + Ay) = cJA(y)• 
Then (4.17) and (4.18) give 

(4.20) AA(y + Ay)-AA(y) = HA(y)Ay. 
But since ||HA(y)Ay|| < ||Ay||, (2.6) is proved. 
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By the local constancy of H{y) and uj{y), we have 

(4.21) ^ = HA(y). 

Then the trace formula (4.2) implies that 

(4.22) V-AA(y) = tr(HA(y)) = |SA|. □ 

Proof of Lemma 7. Suppose at step m, |>Vm(y)| > 2. Let i^dd and jadd 
be two of the predictors in Wm(y)i and let i*^^ and j*^^ be their indices in 
the current active set A. Note the current active set A is Bm in Lemma 2. 
Hence we have 

(4.23) Xm = v[A,i*]X.^y and Xm = v[A, j*]X.^y . 
Therefore 

(4.24) = {[v{A,i:^^) - v{Aj:^^)]X.'^}y =: aaddy. 

We claim aadd = [v{A, ^add) ~ ^(•^'iadd)]-^5 ^^o* ^ vector. Otherwise, 
since {Xj} are linearly independent, aadd = forces ^(.4, il^^) — ^("4, j*^^) = 
0. Then we have 

.425) (X^X^)-M^%-] _ (X^X^)-Mi*,-] 

^ • ' (X5X^)-M^^•]Sgn^ (X^X^)-i[i*,-]Sgn^' 

which contradicts the fact (X^X_4)~^ is a full rank matrix. 
Similarly, if idrop and jdrop are dropped predictors, then 

(4.26) = { [v{A, i^rop) - V{A, idrop)]XS}y =: "dropy, 

and adrop = b(-4,«drop) " '"i-^^jdvop)]^A ^ nonzero vector. 

Let Mq be the totality of Oadd and adrop by considering all the possible 
combinations of ^, («add,iadd), («drop, Jdrop) and Sgn^^. Clearly Mq is a finite 
set and depends only on X. Let 

(4.27) A/'o = {y:ay = for some aeMo}. 
Then on \ A/q the conclusion holds. □ 

Proof of Lemma 8. Note that /3(A) is continuous on A. Using (4.4) in 
Lemma 1 and taking the limit of A — > Am,, we have 

(4.28) - 2xj|^y - ^^^x,/3(Am)i^ + A^ Sgn(/3(A„),) = 0, for j G ^(A^). 

However, E^=i Xj/3(Am)j = X^jeBCA^) Xj/3(Am)j. Thus we have 

(4.29) /3(Am) = (Xi(,,^)Xs(,„))^i Ulix^)y - ^ Sgn(A„ 
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Hence 

(4.30) ^ 

Since i G Wm, we must have the equiangular condition 

(4.31) Sgn,xf(y-A(m)) = ^. 
Substituting (4.30) into (4.31), we solve Am/2 and obtain 



(4.32) 



Am _ xf(I-Hi5(A„))y 



2 Sgn, -xf X^(^^) (X^(^^)Xg(A^)) Sgn(Am) 



Then putting (4.32) back to (4.30) yields (3.2). 
Using the identity tr{AB) = ti(BA), we observe 



B(A„)^6(^-)) Sgn(Am)x, (I - i/B(A„))Xg(^^-) 



/ \ TT \ J. (^^^I~'\^ml — - - - ~-\--UL/, i^y 

tr(bm(y) - He(A„)) -tri — -y— 5^ Vq^-TT-T 

V hgn, -X. Xg(^^-)(Xg(;^^-,Xg(;,^)j Sgn(Am) 

= tr(0) =0. 
So tr(Sm(y)) = tr(He(A„)) = |S(Am)|. □ 

5. Discussion. In this article we have proven that the number of nonzero 
coefficients is an unbiased estimate of the degrees of freedom of the lasso. 
The unbiased estimator is also consistent. We think it is a neat yet surpris- 
ing result. Even in other sparse modeling methods, there is no such clean 
relationship between the number of nonzero coefficients and the degrees of 
freedom. For example, the number of nonzero coefficients is not an unbiased 
estimate of the degrees of freedom of the elastic net (Zou [26]). Another 
possible counterexample is the SCAD (Fan and Li [6]) whose solution is 
even more complex than the lasso. Note that with orthogonal predictors, 
the SCAD estimates can be obtained by the SCAD shrinkage formula (Fan 
and Li [6]). Then it is not hard to check that with orthogonal predictors the 
number of nonzero coefficients in the SCAD estimates cannot be an unbiased 
estimate of its degrees of freedom. 

The techniques developed in this article can be applied to derive the 
degrees of freedom of other nonlinear estimating procedures, especially when 
the estimates have piece-wise linear solution paths. Gunter and Zhu [9] used 
our arguments to derive an unbiased estimate of the degrees of freedom of 
support vector regression. Zhao, Rocha and Yu [25] derived an unbiased 
estimate of the degrees of freedom of the regularized estimates using the 
CAP penalties. 
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Biihlmann and Yu [2] defined the degrees of freedom of L2 boosting as 
the trace of the product of a series of linear smoothers. Their approach 
takes advantage of the closed-form expression for the L2 fit at each boosting 
stage. It is now well known that e-L2 boosting is (almost) identical to the 
lasso (Hastie, Tibshirani and Friedman [11], Efron et al. [5]). Their work 
provides another look at the degrees of freedom of the lasso. However, it 
is not clear whether their definition agrees with the SURE definition. This 
could be another interesting topic for future research. 

Acknowledgments. Hui Zou sincerely thanks Brad Efron, Yuhong Yang 
and Xiaotong Shen for their encouragement and suggestions. We sincerely 
thank the Co-Editor Jianqing Fan, an Associate Editor and two referees for 
helpful comments which greatly improved the manuscript. 

REFERENCES 

[1] Akaike, H. (1973). Information theory and an extension of the maximum likelihood 
principle. In Second International Symposium on Information Theory (B. N. 
Petrov and F. Csaki, eds.) 267-281. Academiai Kiado, Budapest. MR0483125 

[2] BiinLMANN, P. and Yu, B. (2005). Boosting, model selection, lasso and nonnegative 
garrote. Technical report, ETH Ziirich. 

[3] DoNOHO, D. and Johnstone, I. (1995). Adapting to unknown smoothness via 
wavelet shrinkage. J. Amer. Statist. Assoc. 90 1200-1224. MR1379464 

[4] Efron, B. (2004). The estimation of prediction error: Covariance penalties and cross- 
validation (with discussion). J. Amer. Statist. Assoc. 99 619-642. MR2090899 

[5] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle 
regression (with discussion). Ann. Statist. 32 407-499. MR2060166 

[6] Fan, J. and Ll, R. (2001). Variable selection via nonconcave penalized likelihood and 
its oracle properties. J. Amer. Statist. Assoc. 96 1348-1360. MR1946581 

[7] Fan, J. and Ll, R. (2006). Statistical challenges with high dimensionality: Feature 
selection in knowledge discovery. In Proc. International Congress of Mathemati- 
cians 3 595-622. European Math. Soc, Ziirich. MR2275698 

[8] Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging 
number of parameters. Ann. Statist. 32 928-961. MR2065194 

[9] Gunter, L. and Zhu, J. (2007). Efficient computation and model selection for the 
support vector regression. Neural Computation 19 1633-1655. 
[10] Hastie, T. and Tibshirani, R. (1990). Ceneralized Additive Models. Chapman and 

HaU, London. MR1082147 
[11] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statis- 
tical Learning; Data Mining, Inference and Prediction. Springer, New York. 
MR1851606 

[12] Knight, K. and Fu, W. (2000). Asymptotics for lasso-type estimators. Ann. Statist. 
28 1356-1378. MR1805787 

[13] Mallows, C. (1973). Some comments on Cp. Technometrics 15 661-675. 

[14] Meyer, M. and Woodroofe, M. (2000). On the degrees of freedom in shape- 
restricted regression. Ann. Statist. 28 1083-1104. MR1810920 

[15] Osborne, M., Presnell, B. and Turlach, B. (2000). A new approach to vari- 
able selection in least squares problems. IMA J. Numer. Anal. 20 389-403. 
MR1773265 



DEGREES OF FREEDOM OF THE LASSO 



21 



[16] SCHWARZ, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461-464. 
MR0468014 

[17] Shao, J. (1997). An asymptotic theory for linear model selection (with discussion). 
Statist. Sinica 7 221-264. MR1466682 

[18] Shen, X. and Huang, H.-G. (2006). Optimal model assessment, selection and com- 
bination. J. Amer. Statist. Assoc. 101 554-568. MR2281243 

[19] Shen, X., Huang, H.-G. and Ye, J. (2004). Adaptive model selection and assessment 
for exponential family distributions. Technometrics 46 306-317. MR2082500 

[20] Shen, X. and Ye, J. (2002). Adaptive model selection. J. Amer. Statist. Assoc. 97 
210-221. MR1947281 

[21] Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Ann. 
Statist. 9 1135-1151. MR0630098 

[22] TiBSHlRANi, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. 
Statist. Soc. Ser. B 58 267-288. MR1379242 

[23] Yang, Y. (2005). Gan the strengths of AIG and BIG be shared?— A conflict be- 
tween model identification and regression estimation. Biometrika 92 937-950. 
MR2234196 

[24] Ye, J. (1998). On measuring and correcting the effects of data mining and model 
selection. J. Amer. Statist. Assoc. 93 120-131. MR1614596 

[25] Zhao, P., Rocha, G. and Yu, B. (2006). Grouped and hierarchical model selection 
through composite absolute penalties. Technical report. Dept. Statistics, Univ. 
Galifornia, Berkeley. 

[26] Zou, H. (2005). Some perspectives of sparse statistical modeling. Ph.D. dissertation. 
Dept. Statistics, Stanford Univ. 



H. Zou 

School of Statistics 
University of Minnesota 
Minneapolis, Minnesota 55455 
USA 

E-MAIL: hzou@stat.umn.cdu 



T. Hastie 
R. Tibshirani 
Department of Statistics 
Stanford University 
Stanford, California 94305 
USA 

E-MAIL: hastie@stanford.edu 
tibs@stanford.edu 



